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Abstract.  We  describe  an  adaptive  procedure  that  approximates  a  function 
of  many  variables  by  a  sum  of  (univariate)  spline  functions  sm  of  selected 
linear  combinations  a  *  x  of  the  coordinates 
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The  procedure  is  nonlinear  in  that  not  only  the  spline  coefficients  but 
also  the  linear  combinations  are  optimized  for  the  particular  problem. 

The  sample  need  not  lie  on  a  regular  grid,  and  the  approximation  is  affine 
invariant,  smooth,  and  lends  itself  to  graphical  interpretation.  Function 
values,  derivatives,  and  integrals  are  inexpensive  to  evaluate. 
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1.  Introduction 

Multidimensional  surface  approximation  is  recognized  as  an  important 
problem  for  which  several  methodologies  have  been  developed.  The  aim  is 
to  construct  an  approximation  4>(x)  to  a  p-dimensional  surface  Y=  f  (x )  on 
the  basis  of  (possibly  noisy)  observations  {(y^.x^}^  .  .  Most  existing 

methods,  such  as  tensor  product  splines,  kernels,  and  thin  plate  splines 
(for  a  survey,  see  Schumaker  [1976]),  are  linear  ir  that 

*(x)  =  2  w-y., 

Isisn  1 

where  the  weights  {w^  depend  only  on  x  and  {x^}^^,  but  not  on 
*yi listen'  These  methods  have  the  advantage  that  they  are  straightforward 
to  compute  and  their  theory  is  tractable.  In  practice,  however,  they  are 
limited  because  they  cannot  take  advantage  of  special  properties  of  the 
surface.  Due  to  the  inherent  sparsity  of  high-dimensional  sampling, 
procedures  successful  in  high  dimensions  must  be  adaptive  and  thus  non- 
1  inear. 

In  this  paper  we  describe  an  adaptive  procedure  that  approximates 

f(x)  by  a  sum  of  (univariate)  spline  functions  sm  of  selected  linear 

combinations  a  •  x  of  the  coordinates 
m 

4>(x)  *  2  s  (a  '  x).  (1) 

IsmsM  m  m 

The  procedure  is  nonlinear  in  that  not  only  the  spline  coefficients  but 
also  the  linear  combinations  are  optimized  for  the  particular  problem. 
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2.  The  Algroithm 

The  spline  function  sm  along  am  •  x  is  represented  as  a  sum  of  j 
B-splines  [de  Boor,  1978]  of  order  q 


s  (am  •  x)  =  2  B„,-Bm.(am  •  x). 
m'  m  l^jsj  ra^  mj  m 


(2) 


The  approximation  $(x)  (given  by  equations  (1)  and  (2))  is  specified  by 
the  directions  the  knot  sequences  along  am  •  x  for  IsmsM,  and 

the  B-spline  coefficients  For  particular  { am} ,  the 

knots  are  placed  heuristically  and  then  the  are  de:*rmined  by 

(linear)  least  squares.  The  residual  sum  of  squares  from  this  fit  is 

taken  to  be  the  inverse  figure  of  merit  for  {a  },  ... 

3  in  IsmsrM 

Following  Friedman  and  Stuetzle  [1981],  the  approximation  is  con¬ 
structed  in  a  stepwise  manner:  given  ,  find  aM  to  optimize  the 

figure  of  merit  of  Terminate  when  the  figure  of  merit  is  below 

a  user  specified  threshold. 

3.  implementation 

A  difficult  part  of  the  algorithm  is  finding  each  direction  am.  We 
perform  a  numerical  search  using  a  Rosenbrock  method  [Rosenbrock,  1966]. 
This  method  is  easily  modifiable  to  search  over  the  unit  sphere.  We  have 
found  empirically  that  each  iteration  of  the  optimizer  requires  approx¬ 
imately  3.5p  function  evaluations,  where  p  is  the  dimension  of  x.  Two 
iterations  are  nearly  always  sufficient.  As  the  search  usually  starts  far 
from  the  solution  and  the  solution  does  not  have  to  be  obtained  with  high 
precision,  it  does  not  seem  likely  that  optimization  procedures  that 
estimate  the  Hessian  would  do  better. 
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For  high  dimensionality,  the  computation  is  dominated  by  the 
evaluations  of  the  object  function.  Since  it  is  not  crucial  to  find  the 
precise  optimum,  considerable  savings  are  achieved  by  substituting  a 
similar,  but  much  less  expensive  figure  of  merit  during  the  search  for  a 
new  direction.  For  this  figure  of  merit  not  only  the  previously  found 
directions  but  also  the  corresponding  spline  coefficents  are  held  fixed. 

For  a  given  direction,  the  residuals  are  modelled  by  (basically)  a  moving 
average  smooth  [see  Friedman  and  Stuet2le,  1981]  The  characteristic 
bandwidth  (the  fraction  of  observations  over  which  averaging  takes  place) 
is  taken  to  be  inversely  proportional  to  the  number  of  kr.ots.  The  residual 
sum  of  squares  from  the  smooth  is  the  figure  of  merit  used  for  the  smooth. 
Solving  the  least  squares  problem  for  the  original  figure  of  merit  requires 


0  jn(  S  j  )2 
L  lsmsW  m 

operations,  while  the  new  figure  of  merit  can  be  evaluated  in  roughly  n 

operations  using  updating  formulas  for  the  moving  average.  The  least 

squares  problem  has  to  be  solved  only  once  for  each  iteration  to  determine 

the  new  model  after  a  has  been  found. 

m 

To  solve  the  least  squares  problem,  we  form  the  normal  equations  and 
use  a  pseudo- inverse,  since  the  design  matrix  might  not  have  full  rank. 

The  singularity  which  arises  from  the  inclusion  of  a  constant  term  for 
each  direction  is  remedied  by  simply  dropping  one  column  per  direction 
from  the  design  matrix.  Higher  order  singularities  caused,  for  example, 
by  the  linear  terms  for  three  co-planar  directions,  are  not  explicitly 
taken  care  of,  but  are  handled  by  the  pseudo-inverse. 


Our  knot  placement  procedu <  *  if  motivated  by  the  sequential  nature 
of  the  algorithm.  At  each  iteration,  the  knot  positions  are  required  for 
the  least  squares  fit,  after  the  new  direction  has  been  found.  Our  model 
at  this  point  is  the  spline  fit  of  the  previous  iteration,  plus  the 
moving  average  smooth  along  the  newly  found  direction.  The  knot  placement 
is  based  on  the  residuals  f r^ >  from  this  model.  Multidimensional  structure 
in  these  residuals  due  to  incompleteness  of  the  model  manifests  itself  as 
high  local  variability  in  the  scatterplots  of  r^  against  am  •  .  In 

order  to  preserve  the  ability  of  fitting  this  structure  in  further 
iterations,  it  is  important  to  avoid  accounting  for  it  by  spurious  fits 
along  existing  directions.  For  this  reason  we  place  fewer  knots  in 
regions  of  higher  local  variability.  Since  the  residuals  change,  the 
knots  are  replaced  along  all  directions  at  each  iteration. 

The  knots  along  a  direction  am  are  placed  as  follows:  the  smooth 
described  above  is  applied  to  {(ry^x,)},^  and  the  vanaM1Uy 

v.j  at  e-ch  point  is  taken  to  be  the  average  squared  residual  from  its 
local  linear  fit.  The  Winsorized  local  variabilities  are  defined  by 


f  2v 

if  v.  >  2v 

1  i 

if  Vl  <  -jV 

'iv 

lvi 

otherwise 

(where  v  =  j-  and  then  are  scaled  so  that  2^.^  ^  =  1. 

The  knots  (t  }  are  placed  to  divide  the  line  into  intervals  with  equal 
content  of  : 
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for  each  l. 


^m 


tt 


4.  Procedure  parameters 

The  operation  of  the  procedure  is  controlled  primarily  by  two 
parameters;  these  are  the  number  of  knots  taken  along  each  direction  and 
the  termination  threshold.  Both  parameters  can  be  adjusted  using  graphical 
output  produced  by  the  program.  The  adequacy  of  the  number  of  knots  and 
their  placement  can  be  judged  by  examination  of  the  residuals  from  the 
final  model  plotted  against  each  a  *  x.  A  systematic  pattern  in  any  one 
of  these  plots  indicates  that  either  the  number  of  knots  is  too  small  or 
that  the  knot  placement  algorithm  did  not  perform  well.  Another  indication 
that  the  number  of  knots  might  be  insufficient  is  that  the  procedure 
chooses  nearly  tne  same  direction  twice,  thereby  effectively  doubling  the 
number  of  knots  placed  along  that  direction. 

The  value  set  for  the  termination  threshold  determines  the  number  of 
terms  making  up  the  model.  Various  criteria  can  be  used  to  decide  whether 
a  particular  term  should  be  included.  In  the  case  of  noisy  data,  one  can 
ask  whether  a  term  is  significantly  different  from  zero  (given  all  previous 
terms),  or  whether  the  addition  of  the  term  reduces  the  predictive  mean 
squared  error  of  the  model.  Also,  considerations  outside  the  data  having 
to  do  with  the  problem  setting  can  influence  such  a  decision.  In  order  to 
judge  statistical  significance,  it  is  necessary  to  know,  by  how  much  one 
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would  expect  an  additional  term  to  increase  the  figure  of  merit  if 
there  were  no  structure  in  the  residuals.  This  can  be  estimated  with  a 
permutation  test.  The  residuals  (from  the  previous  terms)  are  randomly 
permuted  among  the  obervations,  thereby  guaranteeing  no  structure  in  the 
(permuted)  data.  MASA  is  applied  to  these  residuals  and  the  increase  in 
figure  of  merit  noted.  This  process  can  be  repeated,  obtaining  a  (null) 
distribution  of  the  figure  of  merit.  Either  formal  or  informal 
hypothesis  testing  techniques  can  then  be  used  to  judge  whether  the 
nonpermuted  figure  of  merit  is  significant. 

The  optimal  number  of  terms  with  respect  to  prediction  error  can  be 
estimated  by  cross  validation.  The  observations  are  randomly  divided 
into  L  (typically  5-10)  subsamples.  Each  of  the  subsamples  are  in  turn 
set  aside  and  the  model  constructed  from  the  remaining  observations. 

Each  observation  is  set  aside  exactly  once.  The  mern  squared  prediction 
error  averaged  over  the  set  aside  observations  is  taken  as  an  estimate  of 
the  model  mean  squared  error.  Such  as  estimate  can  be  made  for  models 
with  differing  numbers  of  terms  and  that  model  minimizing  the  cross 
validated  mean-squared  error  estimate  is  then  selected.  Both  permutation 
tests  and  cross  validation  can  be  implemented  in  a  small  driver  routine 
which  calls  MASA  repeatedly. 

5.  Examples 

In  this  section  we  present  and  discuss  the  results  of  applying  the 
Multidimensional  Spline  Approximation  method  (MASA)  to  four  examples.  (A 
FORTRAN  program  implementing  MASA  is  available  from  the  authors.)  The 
first  three  examples  were  suggested  elsewhere  for  testing  surface 


7 


approximation  procedures.  The  function  in  the  fourth  example  was 
studied  in  connection  with  a  problem  In  mathematical  genetics. 

The  first  example  is  taken  from  Friedman  [1979].  In  this  example 
uniformly  distributed  random  points  {x..  j  1  s  i  s  200}  were  generated  in 
the  six-dimensional  hypercube  [0,1]^.  Associated  with  each  point  x^  was 
a  surface  value 

y.  =  10  sin  (nx.(l)xi(2))  +  20[xi(3)-0.5]2  +  10x.(4)  +  5x.(5)  +  0x1(6)+61, 

where  the  {6^  were  independent  identically  distributed  standard  normal. 

The  inverse  figures  of  merit  for  the  approximation  with  M  =  1,...,4  terms 
were  6.71,  4.29,  1.87,  0.97.  In  three  restarts,  the  figure  of  merit  did 
not  decrease  below  0.86,  so  M  *  4  was  chosen.  The  four  linear  combinations 
and  the  corresponding  spline  functions  are  shown  in  figures  1.1 -1.4. 

(The  function  value  is  plotted  on  the  vertical  axis,  a  ■  x  on  the  horiz¬ 
ontal  axis.  The  "+"  signs  on  the  bottom  of  the  graph  indicate  the  knot 
positions.  A  "+"  sign  followed  by  a  number  indicates  multiple  knots.) 

For  completeness,  the  program  parameters  are  also  listed;  see  comments  in 
the  program  source  code  for  a  detailed  explanation.)  The  spline  along  the 
first  linear  combination  (figure  1.1)  is  seen  to  model  the  linear  part 
of  the  surface.  The  second  term  in  the  approximation  (figure  1.2)  models 
the  additive  quadratic  dependence  on  x(3).  The  final  two  terms  (figures 
1.3,  1.4)  model  the  Interaction  between  x(l)  and  x(2).  The  L2  norm  of  the 
error  |jf  -  was  0.57. 

Although  the  full  advantages  of  MASA  compared  to  other  procedures 
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are  realized  in  higher  dimensional  or  noisy  settings,  we  applied  it  to 
two  bivariate  examples  used  by  Franke  [1979]  to  compare  a  number  of 
interpolatory  surface  approximation  schemes.  For  both  examples  100 
uniformly  distributed  random  points  in  the  unit  square  [0,1]  were 
generated.  The  function  in  Franke' s  first  example  is 

f(x,y)  =  0.75  exp[ — -  ] 

.0.75  „PE-^2  -  Sf$L] 

.0.5  exp[-  3 

+0.2  exp[-(9x-4)2  -  (9y-7)2] 

Considerations  similar  to  those  in  the  previous  example  led  to  an 
approximation  with  three  terms.  The  linear  combinations  and 
corresponding  spline  functions  are  shown  in  figures  2. 1-2. 3. 

The  function  in  Franke' s  second  example  is 

f(x,y)  =  |  [tanh{9y-9x)  +  1]. 

for  this  case  the  approximation  used  only  one  term,  shown  in  figure  3.1. 

Since  different  random  points  were  used  in  Franke' s  and  our  tests, 
precise  comparisons  are  not  possible.  On  the  first  example,  MASA  gave 
roughly  an  order  of  magnitude  larger  errors  than  the  best  methods  in 
Franke's  trials  (global  basis  function  methods)  while  on  the  second 


example,  MASA  gave  an  order  of  magnitude  smaller  errors  than  the  best 

methods.  These  results  are  not  surprising  since  the  peak-shaped  basis 

functions  of  the  global  basis  methods  are  especially  suited  for 

representing  the  peaks  of  the  first  example,  whereas  the  ridge-shaped 

basis  functions  of  MASA  are  especially  suited  to  the  second  example. 

Unfortunately,  peak-shaped  basis  functions  are  not  appropriate  for 

moderate  or  higher  dimensionality.  The  difficulty  is  that  in  order  to 

achieve  a  smooth  fit,  the  width  of  the  basis  peaks  needs  to  be 

comparable  to  the  distance  between  data  points.  For  n  uniformly  distributed 

random  points  in  a  p-dimensional  hypercube  [0,1 ]p,  the  typical  nearest 

1  1 

neighbor  distance  is  (-)p  .  In  particular  for  n  =  1000  and  p  =  10,  this 
distance  is  0.5,  and  for  p  =  20  is  0.7.  Thus  variation  of  the  surface 
over  distances  small  compared  to  such  large  interpoint  distances  cannot 
be  well  approximated  with  these  global  basis  functions  methods. 

Our  final  example  is  a  19-dimensional  function  encountered  by 
Carmel li  and  Cavalli  [1979].  An  important  question  is  the  structure  of 
this  function  near  its  minimum.  We  sampled  the  function  at  200  points 
uniformly  distributed  in  a  small  hypercube  centered  at  the  minimum 
found  by  numerical  optimization  and  applied  MASA.  The  inverse  figure  of 
merit  for  the  best  constant  fit  was  13.3.  The  inverse  figure  of  merit  for 
M  =  1  was  0.78.  In  30  restarts,  the  figure  of  merit  did  not  decrease 
below  0.42.  Figure  4.1  gives  the  linear  combination  and  corresponding 
spline  function.  This  shows  that  most  of  the  structure  in  the 
likel ihood  function  is  revealed  In  this  one  projection.  The  structure 
certainly  would  not  be  easy  to  find  by  just  looking  at  the  definition  of 


the  function,  and  we  know  of  no  other  approximation  method  that  would 
yield  this  kind  of  information. 

6-  Discussion 

MASA  can  be  expected  to  work  well  to  the  extent  that  the  surface  can 
be  approximated  by  a  function  of  the  form  (1).  Of  course  in  the  limit 
M -»  °°  all  smooth  surfaces  can  be  represented  by  (1),  but  even  for  moderate 
M  functions  of  this  form  constitute  a  rich  class. 

As  seen  in  the  previous  section,  an  advantage  of  using  essentially 
one-dimensional  basis  functions  is  the  possibility  of  graphical 
interpretation.  The  entire  model  can  be  represented  by  graphing 
sm(am  •  x)  against  am  •  x  and  by  specifying  (perhaps  graphically 

for  p  =  2  or  3).  Additionally  the  graphical  output  is  very  helpful  for 
setting  the  main  procedure  parameters,  the  number  of  knots  along  each 
direction  and  the  termination  threshold.  Proper  termination  of  the 
algorithm  can  be  assured  by  monitoring  at  each  iteration  the  plot  of  the 
residuals  from  the  model  of  the  previous  iteration  along  the  newly  found 
direction. 

The  problem  of  sparse  sampling  in  high  dimensions  is  not  encountered, 
since  MASA  is  fitting  one-dimensional  projections  of  the  entire  sample. 

The  sample  need  not  lie  on  a  regular  grid,  and  the  approximation  is  affine 
invariant  and  smooth.  Function  values,  derivatives,  and  integrals  are 

inexpensive  to  evaluate.  In  addition,  since  the  approximation  is  locally 
quadratic  for  q  =  3,  optimization  algorithms  can  be  expected  to  converge 
rapidly.  As  only  the  directions,  the  knot  positions  and  the  B-spline 


r 


n 

coefficients  have  to  be  stored,  MASA  produces  a  very  parsimonious 
description  of  the  surface. 
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SPLINE  FUNCTION  AN D  KNOTS  ALONG  DIRECTION  NR  1 
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